SPATIAL REGRESSION AND THE BAYESIAN FILTER
JOHN HUGHES
DEPARTMENT OF BIOSTATISTICS AND INFORMATICS
UNIVERSITY OF COLORADO DENVER
Abstract. Regression for spatially dependent outcomes poses many chal-
lenges, for inference and for computation. Non-spatial models and traditional
spatial mixed-eﬀects models each have their advantages and disadvantages,
making it diﬃcult for practitioners to determine how to carry out a spatial
regression analysis.
We discuss the data-generating mechanisms implicitly
assumed by various popular spatial regression models, and discuss the im-
plications of these assumptions. We propose Bayesian spatial ﬁltering as an
approximate middle way between non-spatial models and traditional spatial
mixed models. We show by simulation that our Bayesian spatial ﬁltering model
has several desirable properties and hence may be a useful addition to a spatial
statistician’s toolkit.
1. Introduction
Spatially referenced data arise in sundry ﬁelds of inquiry, e.g., radiology, neu-
roscience, epidemiology, marketing, ecology, agriculture, forestry, geography, and
climatology. Because spatial data tend to exhibit spatial dependence (usually at-
tractive but sometimes repulsive or even a combination of the two), a number of
statistical models, collectively referred to as spatial models, have been developed
for analyzing such data (Banerjee et al., 2014).
Since dependence is customar-
ily considered to be a second-moment phenomenon, nearly all spatial models are
second-moment models.
In fact, second-moment methods so dominate the ﬁeld
that allowing “second-moment” to be a deﬁning characteristic of spatial models
would not be unreasonable. Here we revisit this important assumption, and discuss
what the assumption implies regarding the data-generating process. Our goals are
to (i) provide an appreciation of the assumptions underpinning our models, and (ii)
understand how these assumptions may impact the results of a spatial regression
analysis.
Often, the aim of a spatial analysis is to do inference regarding the eﬀects
β = (β1, . . . , βp)′ of a number of spatially structured covariates X = (x1 · · · xp).
By accounting for spatial dependence in excess of that explained by Xβ, it is
claimed, spatial regression models permit more reliable inference for β, and bet-
ter prediction, than do non-spatial models.
But whether a given spatial model
yields improved regression inference and/or prediction depends on the posited data-
generating mechanism (i.e., the “true” model from which the data arose) as well as
the properties of said spatial model.
The rest of this manuscript is organized as follows. In Section 2 we review the
class of spatial models and discuss them as data-generating mechanisms. In Sec-
tion 3 we discuss how our modeling assumptions impact spatial regression inference
1
arXiv:1706.04651v2  [stat.ME]  31 Jul 2017

2
HUGHES
and prediction. In Section 4 we discuss computing for spatial regression. In Sec-
tion 5 we apply six regression models to simulated outcomes in an eﬀort to assess
their performance in a challenging, but realistic, setting informed by the discussion
in Sections 2 and 3. We develop Bayesian spatial ﬁltering, a new approach to spatial
regression, in Section 6. We then conclude in Section 7.
2. Spatial Data: Ontology versus Phenomenology
In this section we will examine spatial models as data-generating mechanisms.
We begin by reviewing the most commonly applied spatial regression models—
partly to introduce useful notation, and partly to highlight the models’ second-
order components. Then we will discuss what sort of generating mechanism we are
assuming when we apply each of these models.
2.1. A Brief Review of Spatial Regression Models. Let Z = (Z1, . . . , Zn)′
be the response vector, where Zi is observed at spatial location si. If said locations
are points residing in a continuous spatial domain (e.g., a Borel subset of R2 or
near the surface of a biaxial ellipsoid), the outcomes are said to be point-level or
geostatistical. If si instead refers to an area over which measurements have been
aggregated (e.g., county, voxel, Census tract) to produce Zi, the outcomes are said
to be areal.
Along with Z we have p covariates x1, . . . , xp, where xj = (x1j, . . . , xnj)′ and
xij, like Zi, was measured at spatial location si. Presumably, each of x1, . . . , xp is
spatially structured and so may be useful for explaining a signiﬁcant portion of the
spatial variation exhibited by Z.
It is often the case that Z exhibits additional spatial structure, i.e., spatial
structure that cannot be explained by Xβ alone.
The most common means of
accounting for this additional structure is to augment the linear predictor Xβ with
spatially dependent random eﬀects.
This leads to the spatial generalized linear
mixed model (SGLMM), for which the transformed conditional mean vector is
given by
g(µ) = Xβ + ψ,
(1)
where g(µ) = ⟨g(µ1), . . . , g(µn)⟩, g is a link function, µi = E(Zi | ψi), and
ψ = (ψ1, . . . , ψn)′ are latent spatially dependent random eﬀects. Conditional on
ψ, the outcomes are assumed to be independent draws from a suitable distribution
(common choices are binomial, Gaussian, and Poisson). Whether the spatial do-
main is continuous (Diggle et al., 1998) or discrete (Besag et al., 1991), the spatial
random eﬀects are nearly always assumed to be multinormal with mean 0 (Haran,
2011), and so variants of the SGLMM are distinguished by alternative speciﬁca-
tions of ψ’s covariance matrix Σ, which is usually structured to accommodate (or
induce) spatial clustering.
For areal data, spatial proximity is deﬁned in terms of an undirected n-graph
G = (V, E), where V = {1, . . . , n} are the vertices and E ⊂V × V are the edges.
The vertices of G represent the areal units, and the edges of G represent adjacencies
among the units (usually, a pair of vertices share an edge iﬀtheir corresponding
areal units share a boundary). In this setting Σ is typically a function of G’s adja-
cency matrix—A = (Auv = 1{(u, v) ∈E})—and perhaps one or more dependence
parameters. A famous possibility is the proper conditional autoregressive (CAR)
model, in which Σ is equal to (τQ)−1, where τ > 0 is a smoothing parameter

SPATIAL REGRESSION AND THE BAYESIAN FILTER
3
and Q = diag(A1) −ρA, with ρ ∈[0, 1) behaving like a range parameter. This
implies that ψ is a Gaussian Markov random ﬁeld (GMRF) (Rue and Held, 2005),
which implies that ψu and ψv are independent conditional on their neighbors iﬀ
areal units u and v are not adjacent. That G’s adjacency structure corresponds to
a conditional independency structure for ψ is widely considered to be an appeal-
ing characteristic of this and similar deﬁnitions of Σ. Unfortunately, the resulting
marginal dependence structure for ψ may be counterintuitive or even pathological
(Wall, 2004; Assun¸c˜ao and Krainski, 2009).
For point-level observations, the elements of Σ are given by a spatial covari-
ance function: Σuv = k(su, sv). A common choice for k is the M´atern covariance
function, which is given by
k(su, sv) = kσ,ν,ρ(∥su −sv∥) = σ2 21−ν
Γ(ν)
 √
2ν∥su −sv∥
ρ
!ν
Kν
 √
2ν∥su −sv∥
ρ
!
,
where ∥su −sv∥is the distance between su and sv, σ2 is the common variance,
ν > 0 is a smoothness parameter, Γ denotes the gamma function, ρ > 0 is a range
parameter (often referred to as the characteristic length scale), and Kν is the modi-
ﬁed Bessel function of the second kind. This deﬁnes a Gaussian process (Rasmussen
and Williams, 2006). Since k depends only on distances between locations, the pro-
cess is stationary, i.e., translation invariant. If the norm is the Euclidean norm,
the process is also isotropic, which is to say that the variability is the same in all
directions.
A second approach to accommodating/inducing extra-Xβ spatial structure in
areal outcomes is to augment Xβ with an autocovariate in place of the SGLMM’s
random eﬀects, in which case the linear predictor is given by
g(µ) = Xβ + κA{Z −g−1(Xβ)},
(2)
where µi = E(Zi | {Zj : (i, j) ∈E}) and dependence parameter κ captures the
“reactivity” of the outcomes to their neighbors, conditional on the independence
expectations E(Z | κ = 0) = g−1(Xβ).
(A positive value of κ implies spatial
attraction while a negative value implies repulsion, and larger |κ| produces/indicates
stronger dependence.) This deﬁnes the automodel (Besag, 1974), a type of Markov
random ﬁeld (MRF) model (Kindermann and Snell, 1980; Cliﬀord, 1990).
The
proper CAR model described above is a special case. Another noteworthy example
is the autologistic model (Caragea and Kaiser, 2009; Hughes et al., 2011) for binary
data, for which (2) takes the form
logit(µ) = Xβ + κA{Z −ζ},
where
ζ = {1 + exp(−Xβ)}−1,
or, more explicitly,
log P(Zi = 1 | {Zj : (i, j) ∈E})
P(Zi = 0 | {Zj : (i, j) ∈E}) = x′
iβ + κ
X
j:(i,j)∈E
[Zj −{1 + exp(−x′
jβ)}−1],
for i = 1, . . . , n.
A third, and newer, type of spatial regression model is the spatial copula regres-
sion model (SCRM) (Kazianka and Pilz, 2010; Hughes, 2015). Unlike the SGLMM
and automodel, the SCRM is a marginal model, which is to say the regression

4
HUGHES
coeﬃcients have the same interpretation as in the classical GLM (McCullagh and
Nelder, 1983). A common choice for the joint component of the spatial CRM is the
spatial Gaussian copula
Φ0,R{Φ−1(u1), . . . , Φ−1(un)},
where the ui are standard uniform, Φ0,R denotes the cdf of the multinormal dis-
tribution with mean vector 0 and spatial correlation matrix R, and Φ−1 is the
standard normal quantile function. See Joe (2014) for an extensive treatment of
copula models, and Kolev and Paiva (2009) for a review of copula-based regression
models.
The copula can be applied to the outcomes directly, or be employed in a hier-
archical fashion. The gamma–Poisson model provides an intuitive example of the
latter:
Zi | λi
ind
∼
P (λi)
λi
∼
G (νµi, ν)
{ψi = Φ−1 {Fi (λi)}}n
i=1
∼
N (0, R) ,
where P denotes the Poisson distribution, G denotes the gamma distribution, µi =
g−1(x′
i β), and Fi is the G(νµi, ν) cdf. In this formulation the copula is applied to
the λi (which are marginally gamma and exhibit Gaussian dependence), and so the
outcomes are dependent because the λi are dependent.
Two additional spatial regression models are the simultaneous autoregressive
model (Cressie, 1993) and the clipped random ﬁeld (De Oliveira, 2000). Although
interesting, these models are not applied as often as the models described above,
and so, in the interest of brevity, we will not consider them further in this work.
2.2. Interpreting Spatial Regression Models. What do the above mentioned
models—the SGLMM, the automodel, and the SCRM—mean if we attempt to grant
ontological status to their second-order components? This is clearly not an issue for
X since we are in possession of it and believe it to be more fundamental than the
outcomes (in the sense that much of the spatial variation exhibited by the response
can be attributed to X). What we seek are equally fundamental interpretations of
the models’ dependence components.
Let us ﬁrst consider the SGLMM, which induces extra-Xβ spatial variation by
augmenting the classical linear predictor with spatially dependent random eﬀects
ψ. To what aspect of reality does ψ refer? A prima facie interpretation of ψ would
lead us to conclude that ψ is an unobservable realization of some spatial process
(just as each column of X is an observable realization of some spatial process) and
that said process acts on the outcomes on link scale and in an additive fashion.
But this interpretation of ψ does not explain the extra-Xβ spatial variation in the
outcomes. This interpretation merely accommodates, i.e., reveals the pattern of,
that additional variation but cannot describe its origin. That is, this apparently
ontological interpretation of ψ is, in fact, phenomenological—is, in fact, no more
fundamental than the outcomes themselves.
It is perhaps just as diﬃcult to tie the automodel’s autocovariate term κA{Z −
g−1(Xβ)} to (non-mathematical) reality. Since the autocovariate, unlike ψ, in-
volves Xβ, one might argue that the autocovariate is more fundamental than ψ.
But the autocovariate is also self-referential, i.e., it contains the response we aim to

SPATIAL REGRESSION AND THE BAYESIAN FILTER
5
explain. And so it is not clear how one might arrive at a sensible realist interpre-
tation of the autocovariate term. The term does admit an intuitive phenomenolog-
ical interpretation, however: for the automodel, extra-Xβ spatial variation is de-
ﬁned, quite explicitly, as localized departures from the independence expectations
g−1(Xβ). We might attach this same interpretation to the SGLMM, although there
the mechanism of departure from the independence expectations is less explicit and
is not self-referential.
The copula-based model, whether the copula is applied directly or hierarchically,
is a rather diﬀerent sort of model since it does not induce/accommodate extra-Xβ
spatial variation on the scale of the link function. Instead, the copula acts by way
of quantile transformations. To see this, consider the stochastic form of the copula
model, where we apply the copula to the outcomes (in contrast to the hierarchical
formulation given above):
{ψi}n
i=1
∼
N (0, R)
Ui = Φ(ψi)
∼
U(0, 1)
Zi = F −1
i
(Ui)
∼
P (λi) ,
where F −1
i
is the quantile function of the Poisson distribution with mean λi =
g−1(x′
iβ). Here, extra-Xβ spatial variation originates in the ψi, carries over to the
Ui (which are marginally standard uniform and exhibit Gaussian dependence), and
ﬁnally inﬂuences the outcomes through the quantile transformations F −1
i
(which
also incorporate Xβ). That is, the copula does not induce extra-Xβ variation by
additively perturbing Xβ (or perturbing the λi in any fashion) but instead pushes
the Zi away from the λi by inducing a spatial pattern among the Ui.
Does the copula represent some real-world mechanism? The answer must be no
since ψ in the copula model serves precisely the same role, conceptually, as does ψ
in the SGLMM. Both models can be viewed as latent Gaussian models, and what
distinguishes them is merely the way in which the latent Gaussian random variable
ψ obscures g−1(Xβ).
And so it appears that the dependence components of commonly applied spatial
regression models do not lend themselves to realist interpretations but are instead
merely instrumental. The dependence components of these models may be capable
of generating what we have termed extra-Xβ spatial variation, but the models are
unable to explain spatial variation in the response in the same sense that Xβ can.
2.3. Extra-Xβ Spatial Variation as the Result of Model Underspeciﬁca-
tion. Model underspeciﬁcation oﬀers a plausible realist explanation for extra-Xβ
spatial variation. Speciﬁcally, we might suppose that
g(µ) = Xβ + Xγ,
(3)
where the columns of X are unmeasured spatial predictors, γ their eﬀects. This
implies that extra-Xβ spatial variation is a ﬁrst-moment phenomenon, i.e., the
spatial dependence among the outcomes is due entirely to spatial structure among
the predictors X and X. This view demystiﬁes the spatial regression problem and
allows us to analyze the problem using intuitive and well-understood ideas regarding
ordinary regression modeling (i.e., regression modeling for independent outcomes).

6
HUGHES
3. Spatial Regression Models as Data-Analytic Tools
In the setting of ordinary regression, consider four possibilities for a given model:
(A) the model is correct;
(B) the model is underspeciﬁed, i.e., one or more important predictors is miss-
ing;
(C) the model is overspeciﬁed, i.e., one or more predictors is redundant; or
(D) the model contains extraneous predictors, i.e., one or more predictors is not
related to the response or to any other predictor.
If the true model is linear with spherical Gaussian errors, say,
(A) permits unbiased estimation of the regression coeﬃcients and unbiased pre-
diction, and yields accurate standard errors;
(B) permits unbiased estimation of β only if X is not correlated with X, and
leads to biased prediction and inﬂated standard errors;
(C) permits unbiased estimation of the regression coeﬃcients and unbiased pre-
diction, but standard errors may be inﬂated dramatically due to collinear-
ity; and
(D) permits unbiased estimation of the regression coeﬃcients and unbiased pre-
diction, but standard errors may be inﬂated dramatically if the number of
extraneous predictors is large.
We mentioned in Section 1 that employing a spatial regression model to account
for extra-Xβ spatial variation in the response can allegedly permit more reliable
inference for β than a non-spatial model can. Assuming (3), and in light of (B), this
claim implies that some spatial model(s) can remedy the absence of X, resulting in
(i) more accurate estimation of β, better (ii) coverage and (iii) type II error rates,
and (iv) more accurate prediction. Can any spatial regression model accomplish all
of these tasks? That is, if the data-generating mechanism is (3), can any spatial
regression model, when employed not as data-generating mechanism but as data-
analytic tool, accomplish (i–iv)?
Regarding (i), estimation of β will be biased, perhaps badly so, unless the un-
measured predictors X are not correlated with the measured predictors X. Some
spatial models may be able to provide a surrogate for Xγ, but that is not the same
as revealing X, for it is the relationship between X and X, not the structure of
Xγ, that matters when estimating β. In other words, no spatial model can remedy
unmeasured confounding.
The absence of X need not lead to poor prediction, however. Recall that the
SGLMM and the automodel augment Xβ with, respectively, spatial random eﬀects
ψ or the autocovariate κA{Z −g−1(Xβ)}. Presumably, each of these terms aids
prediction by acting as a surrogate for Xγ. The SCRM (in the form described
above, at least) does not augment Xβ, and so we should expect that model to oﬀer
poorer predictive performance than the SGLMM and automodel.
Although the SGLMM oﬀers better prediction than a non-spatial model or a
copula-based model, the improvement is costly. To see this, it will prove useful to
rewrite the SGLMM’s linear predictor as
g(µ) = Xβ + Pxψ + (I −Px)ψ,
(4)
where Px = X(X′X)−1X′ is the orthogonal projection onto C(X), and I denotes
the n × n identity matrix. This form of the linear predictor allows us to see that

SPATIAL REGRESSION AND THE BAYESIAN FILTER
7
the SGLMM is overspeciﬁed as well as underspeciﬁed: since C(Px) = C(X), the
model is perfectly collinear. This trait of the SGLMM—which inﬂates the variance
of ˆβ, often dramatically, as per (C) above—is called spatial confounding (Clayton
et al., 1993; Reich et al., 2006; Paciorek, 2010; Hodges and Reich, 2010).
The confounding evident in (4) can be eliminated by removing Pxψ, thereby con-
straining smoothing to the residual space C(X)⊥. This technique is called restricted
spatial regression (RSR) (Hodges and Reich, 2010). RSR not only obviates spatial
confounding but can also permit considerable dimension reduction and much more
time- and space-eﬃcient computation (Hughes and Haran, 2013; Hughes, 2014).
Hanks et al. (2015) acknowledged the potential computational beneﬁts of RSR
but cautioned that RSR may lead to erroneous inference for β if (1) is the true
model. According to Hanks et al. (2015), the RSR model, which has linear predictor
g(µ) = Xβ + (I −Px)ψ,
(5)
implicitly assumes that all variation in the direction of X can be explained by
Xβ, whereas the traditional SGLMM can accommodate additional variation in the
direction of X.
To support the latter claim they rewrite (4) as
g(µ) = Xβ + Pxψ + (I −Px)ψ
(6)
= Xβ + X(X′X)−1X′ψ + (I −Px)ψ
= X

β + (X′X)−1X′ψ
	
+ (I −Px)ψ
† = Xδ + (I −Px)ψ.
Similarly, we can rewrite our posited data-generating model (3) as
g(µ) = Xβ + PxXγ + (I −Px)Xγ
(7)
= Xβ + X(X′X)−1X′Xγ + (I −Px)Xγ
= X

β + (X′X)−1X′Xγ
	
+ (I −Px)Xγ
‡ = Xδ + (I −Px)Xγ
to show that (3) can generate additional variation in the direction of X. Hence,
(6) and (7) show that the RSR model—† and ‡—can, in fact must, accommodate
extra variation in the direction of X. That is, when we ﬁt an RSR model, we are
estimating δ, not β, and this is true whether the “true” linear predictor is Xβ + ψ
or Xβ + Xγ (assuming X is correlated with X).
In any case, the crux of the matter is the absence of X.
It is the absence
of X that prevents accurate estimation of β (if X is correlated with X), and
neither the traditional SGLMM nor the RSR model provides a remedy.
What
both models do provide is more accurate prediction (by furnishing a stand-in for
Xγ). The traditional SGLMM accomplishes this at the cost of spatial confounding
and a large (with respect to both time and storage) computational burden. RSR
successfully addresses these problems and, if applied properly, yields signiﬁcantly
better predictive performance than the traditional model (see Section 5 below).
Although unmeasured spatial confounding cannot be remedied (in general, or
entirely, at least), Hanks et al. (2015) suggest another avenue by which inference
for β might be improved. They note that the RSR model may suﬀer from a low
coverage rate for β, and they recommend the larger credible region that results

8
HUGHES
from posterior predictive inference (Gelman et al., 2013) according to
˜β
(k) ∼N{δ(k), (X′X)−1X′Σ(k)X(X′X)−1},
where δ(k) is the kth sample from δ’s posterior, and Σ(k) = Σ(ξ(k)) is the value of
Σ produced from the kth update of the covariance parameters ξ(k). In Section 5
we study how this approach performs in practice.
The spatial confounding caused by adding ψ to Xβ may lead us to suspect that
the automodel, which adds κA{Z −g−1(Xβ)} to Xβ, is likewise confounded. This
is, in fact, the case for the traditional automodel, which has linear predictor
g(µ) = Xβ + κAZ.
Caragea and Kaiser (2009) studied this problem in the context of the autologistic
model and showed that centering the autocovariate alleviates spatial confounding
for the automodel: A{Z −g−1(Xβ)} is to AZ as (I −Px)ψ is to ψ.
Since the SCRM does not augment Xβ, the SCRM is not spatially confounded.
But the SCRM has no way of ﬁtting extra-Xβ spatial variation, and so we should
expect the SCRM’s predictive performance to be no better than that of the ordinary
GLM.
4. Some Computational Aspects of Spatial Regression
Now we turn our attention to computational issues involved in spatial regression.
This topic could easily ﬁll a book, and so our goal is not to provide a thorough
treatment.
We aim to describe only the most important aspects of computing
for spatial regression, and, in so doing, to set the stage for the simulation study
that is the subject of Section 5. We will focus on models for binary areal data,
for four reasons: (1) binary spatial data are common; (2) binary outcomes, being
relatively uninformative, present the most challenging case; (3) the automodel is
an areal model; and (4) although spatial counts are common, the auto-Poisson
and autonegative binomial models permit only negative spatial dependence. (This
limitation of the auto-Poisson and autonegative binomial models can be overcome
through Winzorization (Kaiser and Cressie, 1997), but the resulting models are,
perhaps surprisingly, not often applied.)
4.1. Computing for the Autologistic Model. Maximum likelihood and Bayesian
inference for the autologistic model are complicated by an intractable normalizing
function. To see this, assume the underlying graph has clique number 2, in which
case the joint pmf of the centered model is
π(Z | θ) = c(θ)−1 exp

Z′Xβ −κZ′Aζ + κ
2 Z′AZ

,
where θ = (β′, κ)′ and
c(θ) =
X
Y ∈{0,1}n
exp

Y ′Xβ −κY ′Aζ + κ
2 Y ′AY

is the normalizing function (Hughes et al., 2011). The normalizing function is in-
tractable for all but the smallest datasets because the sample space {0, 1}n contains
2n points.
There are many techniques for doing inference in the presence of intractable
normalizing functions (see, e.g., Park and Haran, 2017). One way is to avoid the

SPATIAL REGRESSION AND THE BAYESIAN FILTER
9
normalizing function altogether. For the autologistic model, this can be accom-
plished by considering the so called pseudolikelihood (PL), which is a composite
likelihood (Lindsay, 1988) of the conditional type. Each of the n factors in the
pseudolikelihood is the likelihood of a single observation, conditional on said obser-
vation’s neighbors:
pi(θ)zi{1 −pi(θ)}1−zi = P(Zi = zi | {Zj : (i, j) ∈E})
= exp[zi{x′
iβ + κa′
i(Z −ζ)}]
1 + exp{x′
iβ + κa′
i(Z −ζ)},
where zi is the observed value of Zi, and a′
i is the ith row of A. Since the pi are
free of the normalizing function, so is the log pseudolikelihood, which is given by
ℓpl(θ) = Z′{Xβ + κA(Z −ζ)} −
X
i
log[1 + exp{x′
iβ + κa′
i(Z −ζ)}].
(8)
Although (8) is not the true log likelihood unless κ = 0, Besag (1975) showed
that the maximum pseudolikelihood estimator (MPLE) converges almost surely to
the maximum likelihood estimator (MLE) as the lattice size goes to ∞(under an
inﬁll, as opposed to increasing domain, regime). For small samples the MPLE is
less precise than the MLE (and the Bayes estimator), but point estimation of β
is generally so poor for small samples that precision is unimportant. When the
sample size is large enough to permit accurate estimation of β, the MPLE is nearly
as precise as the MLE (Hughes et al., 2011).
We ﬁnd the MPLE ˜θ by optimizing ℓpl(θ). This is computationally eﬃcient even
for larger samples. To speed computation even further, we can use a quasi-Newton
(Byrd et al., 1995) or conjugate-gradient algorithm and supply the score function
∇ℓpl(θ) = ((Z −p)′(I −κAD)X, (Z −p)′A(Z −ζ))′,
where p = (p1, . . . , pn)′ and D = diag{ζi(1 −ζi)}.
Conﬁdence intervals can be obtained using a parametric bootstrap (Efron and
Tibshirani, 1994) or sandwich estimation. For the former we generate b samples
from π(Z | ˜θ) and compute the MPLE for each sample, thus obtaining the bootstrap
sample ˜θ
(1), . . . , ˜θ
(b). Appropriate quantiles of the bootstrap sample are then used
to construct approximate conﬁdence intervals for the elements of θ.
The second approach for computing conﬁdence intervals is based on (Varin et al.,
2011)
√n(˜θ −θ) ⇒N{0, I−1
pl (θ)J pl(θ)I−1
pl (θ)},
(9)
where I−1
pl (θ)J pl(θ)I−1
pl (θ) is the Godambe information matrix (Godambe, 1960).
The “bread” in this sandwich is the inverse of the information matrix Ipl(θ) =
−E∇2ℓpl(θ), and the “ﬁlling” is the variance of the score: J pl(θ) = E∇∇′ℓpl(θ).
We use the observed information (computed during optimization) in place of Ipl
and estimate J pl using a parametric bootstrap. For the bootstrap we simulate b
samples Z(1), . . . , Z(b) from π(Z | ˜θ) and estimate J pl as
ˆJ pl(˜θ) = 1
b
b
X
k=1
∇∇′ℓpl(˜θ | Z(k)).

10
HUGHES
Because the bootstrap sample can be generated in parallel and little subsequent
processing is required, these approaches to inference are very eﬃcient computa-
tionally, even for large datasets. We note that sandwich estimation tends to be
much faster than the full bootstrap. Moreover, asymptotic inference and bootstrap
inference yield comparable results for practically all sample sizes because (9) is not,
in fact, an asymptotic result. This is because the log pseudolikelihood is approxi-
mately quadratic with Hessian approximately invariant in law, which implies that
the MPLE is approximately normally distributed irrespective of sample size (Geyer,
2013).
4.2. Computing for the Traditional SGLMM. The traditional SGLMM is
typically applied using MCMC for Bayesian inference, in which case the model for
ψ might be considered a prior distribution. Whether the model is viewed from a
Bayesian or a classical point of view, or is applied to areal data or point-level data,
the computational bottleneck is the handling of ψ’s precision matrix Σ−1.
For point-level outcomes the customary approach to this problem is to avoid
inversion of Σ in favor of Cholesky decomposition followed by a linear solve. Since
Σ is typically dense, its Cholesky decomposition is in O(n3), and so the time com-
plexity of the overall ﬁtting algorithm is in O(n3). This considerable computational
expense makes the analyses of large point-level datasets time consuming or infea-
sible. Consequently, eﬀorts to reduce the computational burden have resulted in
an extensive literature detailing many approaches, e.g., process convolution (Hig-
don, 2002), ﬁxed-rank kriging (Cressie and Johannesson, 2008), Gaussian predictive
process models (Banerjee et al., 2008), covariance tapering (Furrer et al., 2006), ap-
proximation by a Gaussian Markov random ﬁeld (Rue and Tjelmeland, 2002; Lind-
gren et al., 2011), integrated nested Laplace approximations (Rue et al., 2009), and
nearest-neighbor Gaussian process models (Datta et al., 2016).
Fitting the areal version of the model can also be burdensome even though the
areal model is parameterized in terms of Σ−1 and Σ−1 is sparse. It is well known
that a univariate Metropolis–Hastings algorithm for sampling from the posterior
distribution of ψ leads to a slow mixing Markov chain because the components of
ψ exhibit strong a posteriori dependence. This has led to a number of methods for
updating the random eﬀects in a block(s). Constructing proposals for these block
updates is challenging, and the improved mixing comes at the cost of increased
running time per iteration (see, for instance, Knorr-Held and Rue, 2002; Haran
et al., 2003; Haran and Tierney, 2010).
The large dimension of ψ and the slowness of mixing together imply a large
storage requirement too. If RAM capacity is insuﬃcient the samples can be stored
in a ﬁle-backed structure, but this solution is hardly ideal since accessing secondary
storage is many orders of magnitude slower than accessing RAM.
4.3. Computing for the RSR Model. Restricted spatial regression can be done
parsimoniously and eﬃciently by augmenting Xβ with an appropriate basis expan-
sion. Hughes and Haran (2013) employed the linear predictor Xβ + Mη in their
sparse areal mixed model (SAMM), where M is n × q and its columns are the
q principle eigenvectors of the Moran basis; η ∼N{0, (τM′QM)−1} are spatial
random eﬀects; and Q = diag(A1) −A is the Laplacian (Brouwer and Haemers,
2012) of G.
The Moran basis takes its name from the Moran operator for X:
Mx = (I −Px)A(I −Px). This operator appears in a generalized form of Moran’s

SPATIAL REGRESSION AND THE BAYESIAN FILTER
11
I (a popular nonparametric measure of spatial dependence for areal data (Moran,
1950)), which is given by
Ix(v) =
n
1′A1
v′(I −Px)A(I −Px)v
v′(I −Px)(I −Px)v .
(This becomes Moran’s I when Px is replaced with n−111′, i.e., when X = 1.)
Boots and Tiefelsdorf (2000) showed that (1) the (standardized) spectrum of
Mx comprises the possible values for Ix, and (2) the eigenvectors comprise all
possible mutually distinct patterns of clustering residual to C(X) and accounting
for G. The positive (negative) eigenvalues of Mx correspond to varying degrees
of positive (negative) spatial dependence, and the eigenvectors associated with a
given eigenvalue (ωi, say) are the patterns of spatial clustering that data exhibit
when the dependence among them is of degree ωi. In other words, the eigenvectors
of Mx form a multiresolutional spatial basis for C(X)⊥that exhausts all possible
patterns that can arise on G. Three Moran basis vectors are shown in Figure 1.
Figure 1. Three Moran basis vectors, exhibiting spatial patterns
of increasingly ﬁner scale.
Since we do not expect to observe repulsion in the phenomena to which these
models are usually applied, we can use the spectrum of the operator to discard
all repulsive patterns, retaining only attractive patterns for our analysis (although
it can be advantageous to accommodate repulsion (Griﬃth, 2006)). By retaining
only eigenvectors that exhibit positive spatial dependence, we can usually reduce the
model dimension by at least half a priori. And Hughes and Haran (2013) showed
that a much greater reduction is possible in practice, with 50–100 eigenvectors
being suﬃcient for most datasets. Moreover, a simple spherical Gaussian proposal
distribution for η performs well because the elements of η are approximately a
posteriori uncorrelated owing to the orthogonality of the Moran basis.
Although using a truncated Moran basis dramatically reduces the time required
to draw samples from the posterior, and the space required to store those samples,
this approach does incur the substantial up-front burden of computing and eigen-
decomposing Mx. The eﬃciency of the former can be increased by storing A in a
sparse format (Furrer and Sain, 2010) and parallelizing the matrix multiplications.
And we can more eﬃciently obtain the desired basis vectors by computing only the
ﬁrst q eigenvectors of Mx instead of doing the full eigendecomposition. This can
be done using the Spectra library (Qiu, 2017), for example.

12
HUGHES
We note that Guan and Haran (2016) recently developed an approach to RSR
for point-level data. Their approach is based on random projections (Sarlos, 2006;
Halko et al., 2011; Banerjee et al., 2013).
4.4. Computing for the SCRM. The hierarchical copula model and the direct
copula model pose rather diﬀerent computing challenges. And that is not the only
important diﬀerence between the two models. A suﬃciently substantive discussion
of this issue is beyond the scope of this article, but it is worth mentioning that
the hierarchical SCRM may be more appealing from a modeling point of view
(Musgrove et al., 2016) but suﬀers from certain limitations when employed as a
data-analytic tool (Han and De Oliveira, 2016). For this reason we will focus on
copCAR (Hughes, 2015), a form of the direct copula model, here and in Section 5.
copCAR employs the CAR copula, a Gaussian copula (or other suitable copula)
based on the proper CAR described above. Recall that the proper CAR has pre-
cision matrix τQ, where Q = diag(A1) −ρA. Since a copula is scale free, we do
not need τ, but omitting τ does not leave us with an inverse correlation matrix
because the variances σ2 = (σ2
1, . . . , σ2
n)′ = vecdiag(Q−1) are not equal to 1. We
could rescale Q so that its inverse is a correlation matrix, i.e., we could construct
a Gaussian copula using Λ1/2QΛ1/2, where Λ = diag(σ2). In fact, rescaling is
necessary in the general case lest the model be unidentiﬁable with respect to the
variances. For copCAR, however, rescaling is unnecessary because the variances σ2
are not free parameters; the variances are entirely determined by Q’s only depen-
dence parameter, ρ, which is not a scale parameter. Since using Q itself leads to an
identiﬁable model, rescaling would merely slow computation. Thus copCAR em-
ploys the CAR correlation structure indirectly, by using Q along with the variances
σ2. This leads to the CAR copula:
Φ0,Q−1{Φ−1
σ1 (u1), . . . , Φ−1
σn (un)},
(10)
where Φσi denotes the distribution function of the normal distribution with mean
0 and variance σ2
i .
The model speciﬁcation can be completed by pairing the CAR copula with
a set of suitable marginal distributions for the outcomes.
The copula and the
marginals are linked by way of the probability integral transform. Speciﬁcally, if
Z = (Z1, . . . , Zn)′ are the observations, and F1, . . . , Fn are the desired marginal dis-
tribution functions, we have Zi = F −1
i
(Ui), where U = (U1, . . . , Un)′ is a realization
of the copula. We will assume Bernoulli marginal distributions with expectations
{1 + exp(−x′
iβ)}−1.
Unless n is quite small, computation of the copCAR likelihood is infeasible (when
the marginals are discrete) because the multinormal cdf is unstable in high di-
mensions and because the likelihood contains a sum of 2n terms. For Bernoulli
marginals, a composite marginal likelihood approach (Varin, 2008) performs well.
The objective function is a product of pairwise likelihoods:
Lcml(θ | Z) =
Y
i,j∈{1,...,n}
i̸=j
1
X
j1=0
1
X
j2=0
(−1)kHij(Uij1, Ujj2),
where Hij denotes the bivariate Gaussian copula with covariance matrix
Vij =

σ2
i
(Q−1)ij
(Q−1)ij
σ2
j

.

SPATIAL REGRESSION AND THE BAYESIAN FILTER
13
This implies the log composite likelihood
ℓcml(θ | Z) =
X
i∈{1,...,n−1}
j∈{i+1,...,n}
log



1
X
j1=0
1
X
j2=0
(−1)kΦ0,Vij(Yij1, Yjj2)


,
(11)
where Y•0 = Φ−1
σ• {F•(Z•)} and Y•1 = Φ−1
σ• {F•(Z•−1)}. Optimization of (11) yields
ˆθcml.
While ˆβcml tends to be approximately normally distributed, ˆρcml tends to be left
skewed when ρ is close to 1. This implies that asymptotic inference for ρ tends to
result in poor coverage rates. This can be avoided by using a parametric bootstrap,
but a parametric bootstrap is rather burdensome computationally. Luckily, a simple
reparameterization yields an approximately normally distributed estimator because
the objective function for the reparameterized model is approximately quadratic
with constant Hessian (Geyer, 2013). Speciﬁcally, for θ = (β′, Φ−1(ρ))′, we have
√n(ˆθcml −θ)
⇒
N{0, I−1
cml(θ)J cml(θ)I−1
cml(θ)},
where Icml is the Fisher information matrix and J cml is the variance of the score:
J cml(θ) = V∇ℓcml(θ | Z).
Note that the asymptotic covariance matrix for the CML estimator is a Godambe
information matrix (Godambe, 1960) because ℓcml is misspeciﬁed. The matrix can
be estimated in the same manner as we described above for the autologistic model.
The form of ℓcml given in (11) requires four evaluations of the bivariate normal
cdf for each of the n(n −1)/2 pairs of observations. This computation is rather
expensive even for fairly small samples.
In a spatial setting we can expect a pair of nearby observations to carry more
information about dependence than a pair of more distant observations. Others
have found, in a variety of contexts, that retaining the contributions to the CML
made by more distant pairs of observations decreases not only the computational
eﬃciency of the procedure but also the statistical eﬃciency of the estimator (Varin
and Vidoni, 2009; Apanasovich et al., 2008). Hence, we allow only pairs of adjacent
observations to contribute to the copCAR CML. This means replacing (11) with
ℓcml(θ | Z) =
X
i,j: (i,j)∈E
i<j
log



1
X
j1=0
1
X
j2=0
(−1)kΦ0,Vij(Yij1, Yjj2)


.
(12)
If thoughtfully implemented, optimization of (12) is eﬃcient enough to permit anal-
ysis of larger areal datasets.
5. Application of Various Spatial Regression Models to Simulated
Binary Data
Our simulation study focused on binary areal outcomes, for the reasons given
above.
We simulated those outcomes on the 30 × 30 square lattice.
This data
size kept the computational burden manageable while giving all of the approaches
a ﬁghting chance at performing well. Our mean surface was a function of the x
and y coordinates of the lattice points, x = (x1, . . . , xn)′ and y = (y1, . . . , yn)′,
respectively, which we restricted to the unit square centered at the origin. While
simulating data we used linear predictor β0 + β1x1 + β2x2, where x1 = x and

14
HUGHES
x2 = x+y +3s. Vector s exhibits a smaller-scale spatial pattern than do x and y;
this lends more interesting spatial structure to the mean surface and ensures that
x1 and x2 are substantially, but not strongly, correlated (cor(x1, x2) = 0.45 rather
than 0.71). We let β = (0.2, 1, 1)′, which implies a mean vector equal to
p = {1 + exp(−0.2 −x1 −x2)}−1 = {1 + exp(−0.2 −x −x −y −3s)}−1.
These means are shown in Figure 2.
x
y
−0.4
−0.2
0.0
0.2
0.4
−0.4
−0.2
0.0
0.2
0.4
0.0
0.2
0.4
0.6
0.8
1.0
Figure 2. The mean surface for the simulation study.
Predictor x2 was our unmeasured confounder and source of extra-Xβ spatial
variation. That is, we analyzed the data using X = (1 x1), which implies that
X = x2. More speciﬁcally, to each of 100 simulated datasets we applied six models:
(1) the ordinary logistic regression model with linear predictor β0 + β1x1;
(2) the centered autologistic model having regression component β0 + β1x1;
(3) the copCAR model with Ber[{1 + exp(−β0 −β1x1)}−1] marginals;
(4) the traditional CAR model having regression component β0 + β1x1;
(5) the sparse RSR model of Hughes and Haran, having regression component
β0+β1x1 and using the ﬁrst q = 100 eigenvectors of Mx, where X = (1 x1);
and
(6) the sparse RSR model of Hughes and Haran along with the posterior pre-
dictive approach of Hanks et al.
The results are provided in Table 1. We see that the RSR approach of Hughes and
Haran performed better than the other approaches. The RSR estimator of β1 has
the smallest bias and mean squared error, and strikes the best balance between
coverage rate and type II error rate. The RSR model also oﬀers the most accurate
prediction. The traditional CAR model, along with the RSR approach of Hanks
et al., resulted in very high coverage rates at the cost of very high type II error
rates. The other three models performed poorly with respect to coverage rate and
prediction.

SPATIAL REGRESSION AND THE BAYESIAN FILTER
15
Model
Med. Est.
of β1 = 1
Med. CI
Width
MSE
Coverage Rate −Type II Rate
Med. ∥ˆp −p∥
Ordinary Logistic
2.11
0.97
1.29
0% −
0% =
0
4.93
Centered Autologistic
2.17
1.17
1.44
0% −
0% =
0
4.18
copCAR
2.15
1.26
1.36
0% −
0% =
0
4.93
Traditional CAR
2.35
5.27
2.59
99% −61% = 38
3.21
RSR (q = 100)
2.01
2.30
1.18
56% −
2% = 54
3.01
Adjusted RSR (q = 100)
2.01
5.75
3.51
100% −91% =
9
3.01
Table 1. Various performance measures for the ﬁrst simulation
study: median estimate of β1 = 1, median 95% conﬁdence/credible
region width, mean squared error, coverage rate minus type II error
rate, and median prediction error.
Predictions for a single dataset are shown in Figure 3. The autologistic model
and the CAR model clearly undersmooth. The CAR model’s undersmoothing is
less dramatic, but it is perhaps surprising that the CAR model undersmoothes at
all given that it has n spatial random eﬀects. (Note that we could force ˆψ to be
smoother by using Qk (k ≥2) in place of Q (Rue and Held, 2005).)
6. Bayesian Spatial Filtering
In this section we will develop, and assess the performance of, Bayesian spatial
ﬁltering, which possesses the computational advantages and good predictive per-
formance of RSR while allowing for some advantages in regression inference. We
begin by describing classical spatial ﬁltering.
6.1. Classical Spatial Filtering. In developing the SAMM, Hughes and Haran
(2013) drew inspiration from spatial ﬁltering (Griﬃth, 2003), which uses a basis
expansion to accommodate any extra-Xβ spatial pattern exhibited by the response
vector, resulting in conventional residuals, i.e., residuals having at most trace spatial
dependence. This implies that spatial ﬁltering can reveal extra-Xβ structure while
permitting the analyst to apply ordinary, well-understood diagnostic techniques to
the residuals.
The basis used most often in spatial ﬁltering are eigenvectors of the Moran op-
erator for 1: M1 = (I −n−111′)A(I −n−111′). This yields vectors that reside
in C(1)⊥. (Recall that the SAMM employs basis vectors from C(X)⊥, where X
typically contains 1 along with one or more spatially structured predictors.) Con-
siderable dimension reduction can be achieved by using only q ≪n basis vectors.
If we store said vectors as the columns of matrix Fn×q, say, the ﬁltering linear
predictor can be written as
g(µ) = Xβ + Fη,
where η is once again a q-vector of coeﬃcients.
Since constructing F requires that M1 be computed and eigendecomposed, it
is clear that spatial ﬁltering and the SAMM have much in common from a com-
putational point of view. There are two key diﬀerences, however. First, to our
knowledge, there are no Bayesian approaches for spatial ﬁltering; practitioners esti-
mate η by optimizing a likelihood or a composite likelihood. The choice of objective

16
HUGHES
p
ˆp RSR (q = 100)
x
y
−0.4
−0.2
0.0
0.2
0.4
−0.4
−0.2
0.0
0.2
0.4
0.0
0.2
0.4
0.6
0.8
1.0
x
y
−0.4
−0.2
0.0
0.2
0.4
−0.4
−0.2
0.0
0.2
0.4
0.0
0.2
0.4
0.6
0.8
1.0
ˆp CAR
ˆp Autologistic
x
y
−0.4
−0.2
0.0
0.2
0.4
−0.4
−0.2
0.0
0.2
0.4
0.0
0.2
0.4
0.6
0.8
1.0
x
y
−0.4
−0.2
0.0
0.2
0.4
−0.4
−0.2
0.0
0.2
0.4
0.0
0.2
0.4
0.6
0.8
1.0
Figure 3. Predictions for a single simulated dataset. The truth
is shown in the upper left panel.
function of course has a substantial impact on computational complexity. And sec-
ond, the columns of F can be chosen using any of a number of methods (three will
be discussed shortly). Those methods vary greatly in sophistication and computa-
tional complexity. It is not clear how they compare to one another with respect to
quality of regression inference or quality of prediction, however.
Chun et al. (2016) recommend that the ﬁrst
q0 =
n+
1 + exp[2.148 −{6.1808 (zmi + 0.6)0.1742}/n0.1298
+
+ 3.3534/(zmi + 0.6)0.1742],
eigenvectors be included initially in a stepwise, ordinary GLM analysis with a sig-
niﬁcance level of 0.2. Here, n+ is the number of positive eigenvalues of M1 and zmi
is the z score of Moran’s I for the response.
For a binary response, another possibility is to do a two-sample t test for each
of the ﬁrst few hundred eigenvectors, where the eigenvector of interest is treated as
the response and Z is used as the grouping variable. Any eigenvector that yields a
p-value smaller than 0.1, say, is then included in the analysis. In this scheme, the
number of variables may or may not be further reduced using a stepwise procedure.

SPATIAL REGRESSION AND THE BAYESIAN FILTER
17
A third approach to spatial ﬁltering is to include Moran eigenvectors in a spatial
model and use that model’s dependence component to decide which eigenvectors
to retain. For example, one might use some procedure to choose the Moran eigen-
vectors that lead to ˆκ ≈0 for an appropriate automodel, or ˆρ ≈0 for a model that
employs the proper CAR. It is this technique for which spatial ﬁltering is named,
since here an explicit aim is to remove (ﬁlter) spatial dependence from the response
(Griﬃth, 2004).
6.2. A Bayesian Approach to Spatial Filtering. We can develop a Bayesian
approach to spatial ﬁltering by replacing M in the SAMM speciﬁcation with a
ﬁltering design matrix F. Speciﬁcally, the Bayesian spatial ﬁltering (BSF) model
has the same transformed conditional mean as the classical spatial ﬁltering model,
namely,
g(µ) = Xβ + Fη,
where Fn×q contains the q principle eigenvectors of M1, and η is a q-vector of
coeﬃcients. Borrowing from the SAMM, the prior distribution for η is
η ∼N{0, (τF′QF)−1},
(13)
where Q is once again the Laplacian of G. The BSF model, like the SAMM, assigns
β a spherical Gaussian prior with a large variance, and assigns the smoothing
parameter τ a gamma prior with shape parameters 0.5 and 2,000. Note that the
latter prior, having a large mean, discourages artifactual spatial structure in the
posterior (Kelsall and Wakeﬁeld, 1999).
Since η are regression coeﬃcients, one may be tempted to assign η a spherical
Gaussian prior instead of the above mentioned prior. This would be a mistake,
however, for (13) is not arbitrary (see Reich et al. (2006) and/or Hughes and Ha-
ran (2013) for derivations) but is, in fact, very well suited to the task at hand.
Speciﬁcally, two characteristics of (13)—along with the above mentioned prior for
τ—discourage overﬁtting even when q is too large for the dataset being analyzed.
First, the prior variances are commensurate with the spatial scales of the predic-
tors in F (Figure 4). This shrinks toward zero the coeﬃcients corresponding to
predictors that exhibit small-scale spatial variation. Additionally, the correlation
structure of (13) eﬀectively reduces the degrees of freedom in the smoothing com-
ponent of the model.
If the response is non-Gaussian, β and η are updated using Metropolis–Hastings
random walks with Gaussian proposals.
The proposal covariance matrix for β
is the estimated asymptotic covariance matrix from an ordinary GLM ﬁt to the
data, which generally yields an acceptance rate around 50%. The proposal for η is
spherical Gaussian—with standard deviation ση, say. A sensible default value for
ση is 0.1, but a smaller value may be required to achieve large enough acceptance
rates for larger datasets. The update for τ is a Gibbs update irrespective of the
response distribution. If the response is Gaussian distributed, all updates are Gibbs
updates. Note that the BSF MCMC sampler is very easy to tune since ση is the
only tuning parameter (unless the outcomes are Gaussian, in which case no tuning
is required).
Bayesian spatial ﬁltering for point-level outcomes can be accomplished analo-
gously by adapting Guan and Haran’s (2016) random-projection framework. The
resulting BSF model for point-level data is diﬀerent from the areal BSF model in

18
HUGHES
G
G
GGGG
G
G
G
G
GG
GG
G
G
G
G
GG
G
G
GG
G
G
G
GG
G
GG
G
G
GG
GGG
GGG
G
G
GGG
G
GGGG
G
G
G
GG
GGG
G
G
G
G
G
GGGG
G
G
G
G
G
G
GG
G
GGG
G
GG
GG
GGGG
G
GGGG
GG
G
G
G
G
GGGG
G
G
G
GGGGG
G
G
G
GGG
G
GGGGGG
G
GG
G
GGGG
GGGGGGGG
G
GG
G
GGG
GGGGGGGG
G
GG
G
GGGGGGGGGGGGGGG
GG
G
GGGGGGGGGGG
GG
GGGGGGGGGGGGGGGGGGGG
G
GGGGGGG
G
GGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
1.0
0.8
0.6
0.4
0.2
0.0
−1
0
1
2
3
4
Spatial Scale
Log of Prior Variance
Figure 4. Prior variances (on the log scale and for τ = 1) for
the elements of η ∼N{0, (τF′QF)−1} from the second simulation
study. The variances decrease rapidly as the spatial scale decreases,
which prevents overﬁtting.
a potentially important way, however. Since Guan and Haran obtain their basis
vectors by eigendecomposing Σ, they, quite naturally, assign a spherical Gaussian
prior to η. Because a spherical Gaussian prior lacks the appealing attributes of
(13), Guan and Haran recommend that q = rank(F) be chosen in a pre-processing
step. It is not clear how well this approach performs compared to the use of a prior
similar to (13).
6.3. Application of BSF to Simulated Binary Data. As a followup to the
simulation study described in Section 5, we applied our BSF model to the simulated
datasets. We used four diﬀerent values for q = rank(F): 50, 100, 200, and 400. The
results are given in Table 2.
For smaller values of q, ˆβbsf has smaller bias and MSE than any of the other
estimators considered here, and yields a higher coverage rate while keeping the type
II rate very low. As q becomes large, the bias and MSE of ˆβbsf grow. Eventually
the coverage rate begins to decrease, the type II rate to increase. The BSF model
also performs very well at prediction.
The BSF model accomplishes all of this through judicious use of multicollinearity.
Recall that the basis vectors used in RSR are (at least nearly) uncorrelated with the
columns of X. This is not the case for the BSF model since spatial ﬁltering employs
eigenvectors of M1. As we increase q, we introduce more and more multicollinearity
between X and F. Up to a point, this alleviates unmeasured confounding to some
extent (hence the reduced bias), and adjusts the variance upward by a modest
amount (hence the increased coverage rate). As q gets large, the linear predictor
becomes rather redundant, causing the BSF model to perform much like the CAR
model (increased bias and type II error rate).

SPATIAL REGRESSION AND THE BAYESIAN FILTER
19
Model
Med. Est.
of β1 = 1
Med. CI
Width
MSE
Coverage Rate −Type II Rate
Med. ∥ˆp −p∥
Ordinary Logistic
2.11
0.97
1.29
0% −
0% =
0
4.93
Centered Autologistic
2.17
1.17
1.44
0% −
0% =
0
4.18
copCAR
2.15
1.26
1.36
0% −
0% =
0
4.93
Traditional CAR
2.35
5.27
2.59
99% −61% = 38
3.21
RSR (q = 100)
2.01
2.30
1.18
56% −
2% = 54
3.01
Adjusted RSR (q = 100)
2.01
5.75
3.51
100% −91% =
9
3.01
BSF (q = 50)
1.87
2.08
0.94
63% −
2% = 61
3.07
BSF (q = 100)
1.89
2.34
0.96
74% −
2% = 72
3.01
BSF (q = 200)
1.99
2.69
1.25
81% −
2% = 79
2.99
BSF (q = 400)
2.25
3.23
1.88
67% −20% = 47
3.01
Table 2. The performance of Bayesian spatial ﬁltering.
7. Conclusion
When unmeasured confounding is the source of extra-Xβ spatial variation in a
response variable, spatial regression models struggle to perform well. In Section 5
we saw that the autologistic model and the copCAR model (examples of the auto-
model and the spatial copula regression model, respectively) perform rather poorly,
about as poorly as a non-spatial model.
Spatial mixed-eﬀects models perform
better but still have weaknesses. The traditional SGLMM, for example, is badly
spatially confounded and computationally burdensome, and often undersmoothes.
Restricted spatial regression oﬀers an appealing alternative, for RSR reduces bias
and mean squared error, provides a more sensible balance between coverage rate
and type II rate, smoothes very eﬀectively, and permits eﬃcient computation. Yet
there is room for improvement.
In the latter part of this article we developed Bayesian spatial ﬁltering, which
performs as well as RSR with respect to prediction and computational complexity
while besting RSR in terms of bias, mean squared error, and coverage rate. BSF
does this by using an expansion in a well-chosen basis to introduce an appropriate
amount of spatial confounding. This situates the BSF model on a continuum of
spatial confounding, with the non-spatial model and the CAR model at either end:
(Non-Spatial Model)
⇐=
q↘0
(BSF Model)
=⇒
q↗n
(CAR Model).
References
Apanasovich, T., Ruppert, D., Lupton, J., Popovic, N., Turner, N., Chapkin, R.,
and Carroll, R. (2008).
Aberrant crypt foci and semiparametric modeling of
correlated binary data. Biometrics, 64(2):490–500.
Assun¸c˜ao, R. and Krainski, E. (2009). Neighborhood dependence in Bayesian spa-
tial models. Biometrical Journal, 51(5):851–869.
Banerjee, A., Dunson, D. B., and Tokdar, S. T. (2013). Eﬃcient Gaussian process
regression for large datasets. Biometrika, 100(1):75.
Banerjee, S., Carlin, B., and Gelfand, A. (2014). Hierarchical Modeling and Analysis
for Spatial Data. Chapman & Hall/CRC, Boca Raton.

20
HUGHES
Banerjee, S., Gelfand, A., Finley, A., and Sang, H. (2008). Gaussian predictive
process models for large spatial data sets. Journal of the Royal Statistical Society:
Series B (Statistical Methodology), 70(4):825–848.
Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems
(with discussion). Journal of the Royal Statistical Society, Series B: Methodolog-
ical, 36:192–236.
Besag, J. (1975). Statistical analysis of non-lattice data. The Statistician: Journal
of the Institute of Statisticians, 24:179–196.
Besag, J., York, J., and Molli´e, A. (1991). Bayesian image restoration, with two ap-
plications in spatial statistics (Disc: P21-59). Annals of the Institute of Statistical
Mathematics, 43:1–20.
Boots, B. and Tiefelsdorf, M. (2000). Global and local spatial autocorrelation in
bounded regular tessellations. Journal of Geographical Systems, 2(4):319.
Brouwer, A. E. and Haemers, W. H. (2012). Spectra of Graphs. Springer-Verlag.
Byrd, R., Lu, P., Nocedal, J., and Zhu, C. (1995). A limited memory algorithm
for bound constrained optimization.
SIAM Journal on Scientiﬁc Computing,
16(5):1190–1208.
Caragea, P. and Kaiser, M. (2009). Autologistic models with interpretable parame-
ters. Journal of Agricultural, Biological, and Environmental Statistics, 14(3):281–
300.
Chun, Y., Griﬃth, D. A., Lee, M., and Sinha, P. (2016). Eigenvector selection with
stepwise regression techniques to construct eigenvector spatial ﬁlters. Journal of
Geographical Systems, 18(1):67–85.
Clayton, D., Bernardinelli, L., and Montomoli, C. (1993). Spatial correlation in
ecological analysis. International Journal of Epidemiology, 22(6):1193–1202.
Cliﬀord, P. (1990). Markov random ﬁelds in statistics. In Grimmett, G. R. and
Welsh, D. J. A., editors, Disorder in Physical Systems: A Volume in Honour
of John M. Hammersley on His 70th Birthday, pages 19–32. Clarendon Press
[Oxford University Press].
Cressie, N. and Johannesson, G. (2008). Fixed rank kriging for very large spatial
data sets. Journal of the Royal Statistical Society: Series B (Statistical Method-
ology), 70(1):209–226.
Cressie, N. A. (1993). Statistics for Spatial Data. John Wiley & Sons, New York,
2nd. edition.
Datta, A., Banerjee, S., Finley, A. O., and Gelfand, A. E. (2016). Hierarchical
nearest-neighbor Gaussian process models for large geostatistical datasets. Jour-
nal of the American Statistical Association, 111(514):800–812.
De Oliveira, V. (2000).
Bayesian prediction of clipped Gaussian random ﬁelds.
Computational Statistics & Data Analysis, 34(3):299–314.
Diggle, P. J., Tawn, J. A., and Moyeed, R. A. (1998). Model-based geostatistics
(Disc: P326-350).
Journal of the Royal Statistical Society, Series C: Applied
Statistics, 47:299–326.
Efron, B. and Tibshirani, R. J. (1994). An Introduction to the Bootstrap. CRC
Press.
Furrer, R., Genton, M., and Nychka, D. (2006). Covariance tapering for interpola-
tion of large spatial datasets. Journal of Computational and Graphical Statistics,
15(3):502–523.

SPATIAL REGRESSION AND THE BAYESIAN FILTER
21
Furrer, R. and Sain, S. R. (2010). spam: A sparse matrix R package with emphasis
on MCMC methods for Gaussian Markov random ﬁelds. Journal of Statistical
Software, 36(10):1–25.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin,
D. B. (2013). Bayesian Data Analysis. Chapman and Hall/CRC, third edition.
Geyer, C. J. (2013). Le Cam made simple: Asymptotics of maximum likelihood
without the LLN or CLT or sample size going to inﬁnity. In Jones, G. L. and
Shen, X., editors, Advances in Modern Statistical Theory and Applications: A
Festschrift in honor of Morris L. Eaton. Institute of Mathematical Statistics,
Beachwood, Ohio, USA.
Godambe, V. (1960). An optimum property of regular maximum likelihood esti-
mation. The Annals of Mathematical Statistics, pages 1208–1211.
Griﬃth, D. (2004).
A spatial ﬁltering speciﬁcation for the autologistic model.
Environment and Planning A, 36(10):1791–1811.
Griﬃth, D. A. (2003). Spatial Autocorrelation and Spatial Filtering: Gaining Un-
derstanding Through Theory and Scientiﬁc Visualization. Springer, Berlin.
Griﬃth, D. A. (2006). Hidden negative spatial autocorrelation. Journal of Geo-
graphical Systems, 8(4):335–355.
Guan, Y. and Haran, M. (2016). A computationally eﬃcient projection-based ap-
proach for spatial generalized linear mixed models. ArXiv e-prints.
Halko, N., Martinsson, P.-G., and Tropp, J. A. (2011).
Finding structure with
randomness: Probabilistic algorithms for constructing approximate matrix de-
compositions. SIAM Review, 53(2):217–288.
Han, Z. and De Oliveira, V. (2016).
On the correlation structure of Gaussian
copula models for geostatistical count data. Australian & New Zealand Journal
of Statistics.
Hanks, E. M., Schliep, E. M., Hooten, M. B., and Hoeting, J. A. (2015). Restricted
spatial regression in practice: Geostatistical models, confounding, and robustness
under model misspeciﬁcation. Environmetrics, 26(4):243–254.
Haran, M. (2011). Gaussian random ﬁeld models for spatial data. Handbook of
Markov Chain Monte Carlo, pages 449–478.
Haran, M., Hodges, J., and Carlin, B. (2003). Accelerating computation in Markov
random ﬁeld models for spatial data via structured MCMC. Journal of Compu-
tational and Graphical Statistics, 12(2):249–264.
Haran, M. and Tierney, L. (2010). On automating Markov chain Monte Carlo for
a class of spatial models. Bayesian Analysis, pages 1–26.
Higdon, D. (2002). Space and space-time modeling using process convolutions. In
Anderson, C., Barnett, V., Chatwin, P., and El-Shaarawi, A., editors, Quanti-
tative Methods for Current Environmental Issues, pages 37–56. Springer-Verlag,
London.
Hodges, J. S. and Reich, B. J. (2010). Adding spatially-correlated errors can mess
up the ﬁxed eﬀect you love. The American Statistician, 64(4):325–334.
Hughes, J. (2014). ngspatial: A package for ﬁtting the centered autologistic and
sparse spatial generalized linear mixed models for areal data. The R Journal,
6(2):81–95.
Hughes, J. (2015). copCAR: A ﬂexible regression model for areal data. Journal of
Computational and Graphical Statistics, 24(3):733–755.

22
HUGHES
Hughes, J. and Haran, M. (2013). Dimension reduction and alleviation of confound-
ing for spatial generalized linear mixed models. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 75(1):139–159.
Hughes, J., Haran, M., and Caragea, P. C. (2011). Autologistic models for binary
data on a lattice. Environmetrics, 22(7):857–871.
Joe, H. (2014). Dependence Modeling with Copulas. Chapman and Hall/CRC, Boca
Raton, USA.
Kaiser, M. S. and Cressie, N. (1997). Modeling Poisson variables with positive
spatial dependence. Statistics & Probability Letters, 35:423–432.
Kazianka, H. and Pilz, J. (2010). Copula-based geostatistical modeling of contin-
uous and discrete data including covariates. Stochastic Environmental Research
and Risk Assessment, 24(5):661–673.
Kelsall, J. and Wakeﬁeld, J. (1999). Discussion of “Bayesian models for spatially
correlated disease and exposure data”, by Best et al. In Bernardo, J., Berger,
J., Dawid, A., and Smith, A., editors, Bayesian Statistics 6. Oxford University
Press, New York.
Kindermann, R. and Snell, J. (1980). Markov Random Fields and Their Applica-
tions. American Mathematical Society, Providence, RI.
Knorr-Held, L. and Rue, H. (2002). On block updating in Markov random ﬁeld
models for disease mapping. Scandinavian Journal of Statistics, 29(4):597–614.
Kolev, N. and Paiva, D. (2009). Copula-based regression models: A survey. Journal
of Statistical Planning and Inference, 139(11):3847–3856.
Lindgren, F., Rue, H., and Lindstr¨om, J. (2011). An explicit link between Gauss-
ian ﬁelds and Gaussian Markov random ﬁelds: the stochastic partial diﬀerential
equation approach. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 73(4):423–498.
Lindsay, B. (1988). Composite likelihood methods. Contemporary Mathematics,
80(1):221–239.
McCullagh, P. and Nelder, J. A. (1983). Generalized Linear Models. Chapman &
Hall Ltd.
Moran, P. (1950).
Notes on continuous stochastic phenomena.
Biometrika,
37(1/2):17–23.
Musgrove, D., Hughes, J., and Eberly, L. (2016). Hierarchical copula regression
models for areal data. Spatial Statistics, 17:38–49.
Paciorek, C. J. (2010). The importance of scale for spatial-confounding bias and
precision of spatial regression estimators. Statistical Science: A Review Journal
of the Institute of Mathematical Statistics, 25(1):107.
Park, J. and Haran, M. (2017). Bayesian inference in the presence of intractable
normalizing functions. ArXiv e-prints.
Qiu, Y. (2017). Spectra: Sparse eigenvalue computation toolkit as a redesigned
ARPACK.
Rasmussen, C. and Williams, C. (2006). Gaussian Processes for Machine Learning.
Springer.
Reich, B., Hodges, J., and Zadnik, V. (2006). Eﬀects of residual smoothing on the
posterior of the ﬁxed eﬀects in disease-mapping models. Biometrics, 62(4):1197–
1206.

SPATIAL REGRESSION AND THE BAYESIAN FILTER
23
Rue, H. and Held, L. (2005). Gaussian Markov Random Fields: Theory and Appli-
cations, volume 104 of Monographs on Statistics and Applied Probability. Chap-
man & Hall, London.
Rue, H., Martino, S., and Chopin, N. (2009). Approximate Bayesian inference for
latent Gaussian models by using integrated nested Laplace approximations. Jour-
nal of the Royal Statistical Society: Series B (Statistical Methodology), 71(2):319–
392.
Rue, H. and Tjelmeland, H. (2002). Fitting Gaussian Markov random ﬁelds to
Gaussian ﬁelds. Scandinavian Journal of Statistics, 29(1):31–49.
Sarlos, T. (2006). Improved approximation algorithms for large matrices via random
projections. In Foundations of Computer Science, 2006. FOCS’06. 47th Annual
IEEE Symposium on, pages 143–152. IEEE.
Varin, C. (2008). On composite marginal likelihoods. AStA Advances in Statistical
Analysis, 92(1):1–28.
Varin, C., Reid, N., and Firth, D. (2011). An overview of composite likelihood
methods. Statistica Sinica, 21(1):5–42.
Varin, C. and Vidoni, P. (2009).
Pairwise likelihood inference for general state
space models. Econometric Reviews, 28, 1(3):170–185.
Wall, M. M. (2004). A close look at the spatial structure implied by the CAR and
SAR models. Journal of Statistical Planning and Inference, 121(2):311–324.
